IO: Fix parquet read from s3 directory #33632

alimcmaster1 · 2020-04-18T16:01:56Z

closes read_parquet S3 dir support #26388
closes Cannot write partitioned parquet file to S3 #27596
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

(Seems to have also fixed the xfailing test in #33077)

NOTE: lets merge #33645 first - since that fixes up a crucial bit of error handling around this functionality.

doc/source/whatsnew/v1.0.3.rst

jorisvandenbossche

Thanks for looking into this!

jorisvandenbossche · 2020-04-20T15:05:46Z

pandas/io/parquet.py

-            path.close()
-
+        parquet_ds = self.api.parquet.ParquetDataset(
+            path, filesystem=get_fs_for_path(path), **kwargs


Is this filesystem=get_fs_for_path(path) needed? What happens if you just pass the path? (which I assume has eg a s3://.. in it?)

pyarrow seems to only allow a file path opposed to a dir path. Removing filesystem arg here throws:

for path in path_or_paths: if not fs.isfile(path): raise IOError('Passed non-file path: {0}' > .format(path)) E OSError: Passed non-file path: s3://pandas-test/parquet_dir ../../../.conda/envs/pandas-dev/lib/python3.7/site-packages/pyarrow/parquet.py:1229: OSError

To repo see the test case test_s3_roundtrip_for_dir I wrote below

Ah, OK. I see now in pyarrow that apparently string URIs with "s3://..." are not supported (while "hdfs://" is supported). That's something we should fix on the pyarrow side as well. But of course until then this is fine.

jreback · 2020-04-21T12:54:29Z

can you rebase

alimcmaster1 · 2020-04-21T22:57:37Z

pandas/io/parquet.py

@@ -92,8 +97,7 @@ def write(
        **kwargs,
    ):
        self.validate_dataframe(df)
-        path, _, _, _ = get_filepath_or_buffer(path, mode="wb")
-
+        file_obj, _, _, _ = get_filepath_or_buffer(path, mode="wb")


@jorisvandenbossche think we can clean up the write method here to get rid of get_filepath_or_buffer similar to what i've done below for read. Will address in different PR.

alimcmaster1 · 2020-04-21T23:33:26Z

can you rebase

sure merged + green

jorisvandenbossche · 2020-04-22T06:36:33Z

pandas/io/parquet.py

@@ -92,7 +97,7 @@ def write(
        **kwargs,
    ):
        self.validate_dataframe(df)
-        path, _, _, should_close = get_filepath_or_buffer(path, mode="wb")
+        file_obj, _, _, should_close = get_filepath_or_buffer(path, mode="wb")


you didn't change path to file_obj in the if partition_cols is not None: block. Was that on purpose?

That was indeed on purpose, write_to_dataset doesn't support a file like object when that path is a directory.

import pyarrow.parquet import pandas as pd from pandas.io.common import get_filepath_or_buffer path = "s3://pandas-test/dev" file_obj, _,_,_, = get_filepath_or_buffer(path) df = pd.DataFrame({"a": [1,2], "b": [3,4]}) table = pyarrow.Table.from_pandas(df) # Works pyarrow.parquet.write_to_dataset(table, path, partition_cols=["a"]) # Throws AttributeError: 'NoneType' object has no attribute '_isfilestore pyarrow.parquet.write_to_dataset(table, file_obj, partition_cols=["a"])

@jorisvandenbossche

OK, can you add some clarifying comments for it?

is this always a file_obj, never a path? e.g. should rename to filepath_or_buffer ?

Good point, i've renamed to file_obj_or_path since when a local path is passed in a path str is returned.

Add clarifying comment

pep8speaks · 2020-04-25T00:26:00Z

Hello @alimcmaster1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-26 20:58:49 UTC

jreback · 2020-04-25T22:17:48Z

pandas/io/common.py

@@ -150,6 +150,23 @@ def urlopen(*args, **kwargs):
    return urllib.request.urlopen(*args, **kwargs)


+def get_fs_for_path(filepath):


can you type this (and the return annotation)

Left return type for now since it include optional dependencies.

e.g Union[s3fs.S3FileSystem, gcsfs.GCSFileSystem, None]

Can add imports to the TYPE_CHECKING block at the top if that's appropriate?

jreback · 2020-04-25T22:18:20Z

pandas/io/common.py

+def get_fs_for_path(filepath):
+    """
+    Get appropriate filesystem given a filepath.
+    Support s3fs, gcs and local disk fs


can you make this a full doc-string Paramateres / Returns

Sure done :)

jreback · 2020-04-25T22:19:59Z

pandas/io/parquet.py

@@ -92,7 +97,7 @@ def write(
        **kwargs,
    ):
        self.validate_dataframe(df)
-        path, _, _, should_close = get_filepath_or_buffer(path, mode="wb")
+        file_obj, _, _, should_close = get_filepath_or_buffer(path, mode="wb")


is this always a file_obj, never a path? e.g. should rename to filepath_or_buffer ?

doc/source/whatsnew/v1.1.0.rst

jreback

doc-string comment + need to merge master

jreback · 2020-04-26T19:55:47Z

doc/source/whatsnew/v1.1.0.rst

@@ -585,6 +585,8 @@ I/O
  unsupported HDF file (:issue:`9539`)
 - Bug in :meth:`~DataFrame.to_parquet` was not raising ``PermissionError`` when writing to a private s3 bucket with invalid creds. (:issue:`27679`)
 - Bug in :meth:`~DataFrame.to_csv` was silently failing when writing to an invalid s3 bucket. (:issue:`32486`)
+- :func:`read_parquet` now supports an s3 directory (:issue:`26388`)


can you review the doc-strings to see if they need updating (e.g. may need a versionadded tag)

Parquet Docs strings indicate we already supported this I think? I updated the whatsnew and added an example in docs strings.

jreback · 2020-04-26T21:41:19Z

thanks @alimcmaster1

Co-authored-by: alimcmaster1 <[email protected]>

claytonlemons · 2020-05-29T21:40:03Z

pandas/io/parquet.py

-        if should_close:
-            path.close()
-
+        parquet_ds = self.api.parquet.ParquetDataset(


@alimcmaster1

This change breaks clients that pass a file-like object for path. ParquetDataset doesn't provide the same file-like object handling that the original get_filepath_or_buffer did.

Here's the call stack I'm seeing:

.tox/test/lib/python3.7/site-packages/pandas/io/parquet.py:315: in read_parquet return impl.read(path, columns=columns, **kwargs) .tox/test/lib/python3.7/site-packages/pandas/io/parquet.py:131: in read path, filesystem=get_fs_for_path(path), **kwargs .tox/test/lib/python3.7/site-packages/pyarrow/parquet.py:1162: in __init__ self.paths = _parse_uri(path_or_paths) .tox/test/lib/python3.7/site-packages/pyarrow/parquet.py:47: in _parse_uri path = _stringify_path(path) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

I filed bug report #34467

Also skip for it_IT locale

alimcmaster1 added 8 commits April 18, 2020 01:49

parquet init

aa94fe7

Doc Str

a30c71a

Simplify read

b2747eb

Fix writer with partition

a51757a

Test case

968f3b6

Clean up test case

789f4ca

Add whatsnew

040763e

Clean ups

40f5889

alimcmaster1 added the IO Parquet parquet, feather label Apr 18, 2020

jreback requested changes Apr 18, 2020

View reviewed changes

doc/source/whatsnew/v1.0.3.rst Outdated Show resolved Hide resolved

Clean ups

753d647

alimcmaster1 mentioned this pull request Apr 18, 2020

Cannot write partitioned parquet file to S3 #27596

Closed

alimcmaster1 added 5 commits April 18, 2020 17:07

Update whatsnew

e4dcdc3

Add skip if no

bb21431

Fix import

fb38932

Removed fixed xfail

c29befd

remove import

4f78fc5

jorisvandenbossche reviewed Apr 20, 2020

View reviewed changes

alimcmaster1 added 2 commits April 21, 2020 21:29

Merge master

4b2828b

Merge remote-tracking branch 'upstream/master' into mcmali-parquet

463c2ea

alimcmaster1 commented Apr 21, 2020

View reviewed changes

Add further test case

dabfe58

jorisvandenbossche added this to the 1.1 milestone Apr 22, 2020

jorisvandenbossche reviewed Apr 22, 2020

View reviewed changes

Update parquet.py

dea95f3

Add clarifying comment

Update parquet.py

ae76e42

jreback requested changes Apr 25, 2020

View reviewed changes

jreback reviewed Apr 25, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

alimcmaster1 added 4 commits April 26, 2020 12:52

Add whatsnew 2

4b48326

Rename var

211c36e

Improve get_fs_for_path docstring

4897a32

Merge remote-tracking branch 'origin/mcmali-parquet' into mcmali-parquet

bba4040

jreback requested changes Apr 26, 2020

View reviewed changes

alimcmaster1 added 4 commits April 26, 2020 21:53

Add doc example

5bc6327

Make whatsnew clearer

ca89c21

Merge remote-tracking branch 'upstream/master' into mcmali-parquet

0df818e

Merge remote-tracking branch 'upstream/master' into mcmali-parquet

2a1a85c

jreback approved these changes Apr 26, 2020

View reviewed changes

jreback merged commit 22cf0f5 into pandas-dev:master Apr 26, 2020

jreback mentioned this pull request Apr 26, 2020

read_parquet S3 dir support #26388

Closed

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020

IO: Fix parquet read from s3 directory (pandas-dev#33632)

90724c6

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request May 14, 2020

IO: Fix parquet read from s3 directory (pandas-dev#33632)

e39390e

simonjayhawkins mentioned this pull request May 14, 2020

Backport PR #33645, #33632 and #34087 on branch 1.0.x #34173

Merged

simonjayhawkins added a commit that referenced this pull request May 14, 2020

Backport PR #33645, #33632 and #34087 on branch 1.0.x (#34173)

82c5ce2

Co-authored-by: alimcmaster1 <[email protected]>

simonjayhawkins modified the milestones: 1.1, 1.0.4 May 26, 2020

claytonlemons reviewed May 29, 2020

View reviewed changes

This was referenced May 30, 2020

BUG: read_parquet no longer supports file-like objects #34467

Closed

REG: Fix read_parquet from file-like objects #34500

Merged

jorisvandenbossche mentioned this pull request Jun 7, 2020

BUG: s3 reads from public buckets not working #34626

Closed

3 tasks

alimcmaster1 added a commit to alimcmaster1/pandas that referenced this pull request Jun 7, 2020

Revert backport of pandas-dev#33632

6d85e90

alimcmaster1 mentioned this pull request Jun 7, 2020

Revert backport of #33632: Parquet & s3 I/O changes #34632

Merged

dchigarev mentioned this pull request Jun 8, 2020

Updated required python version to 3.6.1+ modin-project/modin#1551

Merged

5 tasks

jorisvandenbossche pushed a commit that referenced this pull request Jun 9, 2020

Revert backport of #33632: Parquet & s3 I/O changes (#34632)

81371a4

Also skip for it_IT locale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO: Fix parquet read from s3 directory #33632

IO: Fix parquet read from s3 directory #33632

alimcmaster1 commented Apr 18, 2020 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Apr 20, 2020

alimcmaster1 Apr 21, 2020

jorisvandenbossche Apr 22, 2020

jreback commented Apr 21, 2020

alimcmaster1 Apr 21, 2020

alimcmaster1 commented Apr 21, 2020

jorisvandenbossche Apr 22, 2020

alimcmaster1 Apr 23, 2020

jorisvandenbossche Apr 24, 2020

alimcmaster1 Apr 25, 2020

jreback Apr 25, 2020

alimcmaster1 Apr 26, 2020

pep8speaks commented Apr 25, 2020 •

edited

Loading

jreback Apr 25, 2020

alimcmaster1 Apr 26, 2020

jreback Apr 25, 2020

alimcmaster1 Apr 26, 2020

jreback Apr 25, 2020

jreback left a comment

jreback Apr 26, 2020

alimcmaster1 Apr 26, 2020

jreback commented Apr 26, 2020

claytonlemons May 29, 2020

claytonlemons May 29, 2020

		@@ -150,6 +150,23 @@ def urlopen(args, *kwargs):
		return urllib.request.urlopen(args, *kwargs)


		def get_fs_for_path(filepath):

IO: Fix parquet read from s3 directory #33632

IO: Fix parquet read from s3 directory #33632

Conversation

alimcmaster1 commented Apr 18, 2020 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 21, 2020

Choose a reason for hiding this comment

alimcmaster1 commented Apr 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Apr 25, 2020 • edited Loading

Comment last updated at 2020-04-26 20:58:49 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alimcmaster1 commented Apr 18, 2020 •

edited

Loading

pep8speaks commented Apr 25, 2020 •

edited

Loading